TADMaster: a comprehensive web-based tool for the analysis of topologically associated domains

Background Chromosome conformation capture and its derivatives have provided substantial genetic data for understanding how chromatin self-organizes. These techniques have identified regions of high intrasequence interactions called topologically associated domains (TADs). TADs are structural and functional units that shape chromosomes and influence genomic expression. Many of these domains differ across cell development and can be impacted by diseases. Thus, analysis of the identified domains can provide insight into genome regulation. Hence, there are many approaches to identifying such domains across many cell lines. Despite the availability of multiple tools for TAD detection, TAD callers' speed, flexibility, result inconsistency, and reproducibility remain challenges in this research area. Results In this work, we developed a computational webserver called TADMaster that provides an analysis suite to directly evaluate the concordance level and robustness of two or more TAD data on any given genome region. The suite provides multiple visual and quantitative metrics to compare the identified domains' number, size, and various comparisons of shared domains, domain boundaries, and domain overlap. Conclusions TADMaster is an efficient and easy-to-use web application that provides a set of consensus and unique TADs to inform the choice of TADs. It can be accessed at http://tadmaster.io and is also available as a containerized application that can be deployed and run locally on any platform or operating system. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-05020-2.


Introduction
TADMaster is a web-based application that provides visualization/analysis of topologically associated domains (TADs) and provides comprehensive comparison of the methods that identify them. TADMaster has two pipelines. The "TADMaster" pipeline provides an analysis suite for comparing two TADs datasets. The "TADMaster Plus" pipeline which takes a user-provided contact matrix and utilizes state-of-the-art normalization and TAD identification methods in order to generate TADs datasets. This pipeline also includes the TADMaster TAD analysis on the resulting datasets. This document will provide further details of the implementation, parameter settings for algorithms, analysis, and estimated run times.

Hi-C Data Formats
TADMasterPlus accepts the two most prevalent Hi-C data formats: sparse matrices, where the format of the matrix is bin1, bin2, then interaction frequency, and the n-by-n matrix originally used by Lieberman et al. [1,8]. The cool format [2] which includes more data than on a traditional sparse matrix, such as edge weights. The hierarchal data format version 5 (.h5) [3] which stores a broad spectrum of data while being optimized for storage. Finally, the .hic format [4] that is a compressed binary format allowing for storage of multiple contact matrices efficiently.

Normalization
Hi-C data that is submitted can be normalized to reduce noise and bias. The normalization is performed on sparse matrix that is either provided or extracted from the supplied input. For TAD Methods that require a non-sparse or n-by-n format the normalization is handled by the TAD method. A brief overview of each supported normalization is shown below:

Sequential component normalization (SCN) [5]
SCN is the method of going through the matrix and reading in values and determining probabilities. These probabilities are expressed as values between 0 and 1, being real non-negative numbers.

Iterative correction and eigenvector decomposition (ICE) [6]
ICE decomposes contact maps into a matrix of biases and contact probabilities. It also assumes all loci should have equal visibility, which improves the performance of modelling when applied across the matrix. From this a higher order organization can be developed.

Knight-Ruiz (KR) [7]
KR takes the non-negative input matrix and tries to find a proportion so the sums of all rows and columns are equal to 1. The values produced are chains of probabilities, or Markov Chain's, which are probabilities of an event based on prior events. With a double stochastic matrix, the chains are uniform.

Vanilla Coverage (VC) [8]
VC takes a value in a matrix and divided it by the sum of the column along with the sum of the row. It reduces noise by taking the matrix values and making them a product of the surrounding interaction frequencies, averaging them out.

Median contact frequency scaling (MCFS) [9]
MCFS takes values between two coordinates in the matrix. The median contact frequency is calculated for those two points, and the values in those coordinates are divided by the median contact frequency.

TAD Methods
Hi-C data can be analyzed by a multiple statistical, probabilistic, and clustering methods in order to identify TADs. A brief overview of each supported TAD method is shown below with the hyper parameters used. If a method only support specific Hi-C data formats and normalization methods, it is also noted:

Armatus [10]
TADs are acquired through dynamic programming identification of domains. These domains are across multiple resolutions, then derive to a consensus set of domains that span all resolutions. Armatus is run using an adjustable max gamma with the default set to 0.2. The step size is set to 0.025 and a minimum number of samples of 100 to compute the mean.

Arrowhead [11]
Domains are found using dynamic programming where the likelihood of a position being the corner of a domain is calculated. The peak values in the matrix are analyzed, with corresponding peaks being labelled as a domain. Arrowhead only works with .hic data format and can be normalized by either KR or VC. Arrowhead is run with the default parameters with the exception of ignoring sparce data inputs. The latter allows Arrowhead to compute results on a wider range of data inputs, however the resulting TADs may not be ideal.

CaTCH [12]
CaTCH stratifies a contact matrix into a series of smaller contact matrices by setting the insulation thresholds, or difference between interaction frequency thresholds. As the threshold amount decreases, borders between contacts regions decrease and thus a hierarchal structure of differentially insulated domains is created. CaTCH is run with the default parameters and searches for TADs within a reciprocal insulation range of 0.55 to 0.65.

ClusterTAD [13]
After normalizing the contact matrix, features for clustering are created from the length of the row and columns, however only one half of the matrix is considered due to it being symmetrical. Then a hierarchal cluster analysis is performed where TADs are derived from contact clusters. ClusterTAD is not run with any parameters, which results in clustering and re-clustering to find TADs. The best TADs based on the quality score are presented for analysis.

Insulation Score [14]
Insulation score sums the contact frequencies around a particular locus using a rectangular sliding window. TADs are determined by a fixed contact count threshold, where regions exceeding threshold are classified by TADs. Three thresholds are used to identify TADs by insulation score. These thresholds are 40, 50, and 60 percent of the maximum found insulation score value. This approach is similar to the default approach of TADTool [15].

DI [16]
Directional index is a simple statistical method for deriving the upstream or downstream interaction bias. TADs can be identified by strongly directionally biased loci. Similar to Insulation Score the default approach proposed by TADTool [15] was utilized; a threshold determines what is statistically significant and were set similar to 40, 50, and 60 percent of the maximum found directional index value.

GMAP [17]
Gmap normalizes the interaction frequency, and a Gaussian mixture model is applied to the matrix to identify intra and inter chromatin domains. Choosing a bin size, the algorithm performs a proportion test between the inter-domain contact count and intra-domain contact count. Block boundaries are established to identify tads. By default, GMAP only utilizes resolution of the dataset to identify hierarchical domains

HiCExplorer [18]
HiCExplorer uses high resolution contact matrices to highlights and identifies important boundaries in contact matrices that can be used to identify TADs using machine learning. HiCExplorer only accepts cool, h5, or hic data formats. Further, it is only compatible with KR and ICE normalization provided by the tools hicCorrectMatrix function. HiCExplorer provides seven tunable parameters for TAD identification, however TADMaster utilizes the default parameters.

HiCseg [19]
HiCseg utilizes segmentation of a Hi-C contact map, the matrix is divided into smaller domains where associated interaction frequencies are grouped. HiCseg is executed using a maximum number of change points set at 1000, the distribution of the data is set to a Gaussian distribution, and a block-diagonal is used for the model type.

IC-Finder [20]
IC-Finder Uses a hierarchal clustering method to analyze contact matrices for the locations of groups of interaction frequencies which are close together and share the same pattern. IC-Finder is executed using segmentation based on SigmaZero set to 5 to control the merging of two clusters.

SpectralTAD [21]
Using a sliding window-based approach, SpectralTAD performs a spectral clustering framework. TADs are identified using gaps between consecutive eigenvectors. SpectralTAD is executed using one level for TAD hierarch with the filtering based on the z-scores of eigenvector gaps.

TopDom [22]
A value is calculated based upon the average contact frequency, upstream and downstream, creating a curve that runs the length of the chromosome. It finds TAD boundaries from local minimums and filters false positives by using statistical analysis. TopDom is executed with and adjustable number of bins, and the default is set to 5.

Running Time Estimates
The TADMaster web server runs on a HP G7-DL980, Intel(R) Xeon(R) CPU E7-4870 @ 2.40GHz server with 120 Cores,1 Terabyte (TB) of RAM and with 40 TB of storage space.

TADMaster: Time from Job Submission to Visualization
The time it takes to upload a job with TADMaster is largely dependent on the size and format of the contact matrix. Jobs that do not include a contact matrix are typically completed within a couple minutes. If the job includes a large contact matrix file in a form that needs to be unpacked, such as the. mcool format, then the completion time can take between 10 to 30 minutes. The time it takes to visualize the results is also dependent on the number of TADs datasets and the size of the contact matrix. A sample of load times are provided in Supplementary Table S1 for human Embryonic Stem Cell (hESC) chromosomes contact matrices at 40KB resolution from Dixon et al. [16].

TADMaster Plus: Time from Job Submission to Visualization
The time it takes to upload a job with TADMaster Plus is largely dependent on the size and format of the contact matrix. Similar to TADMaster, the format of the contact matrix determines the time it takes to be unpacked for normalization. However, this is only a significant portion of run-time if few normalizations and TAD identification methods are selected. Further, the size and number of selected methods are the largest contributor to completion time. A sample of job runtimes with five normalization and ten TAD identification methods selected are provided in Table S2 for 40KB human Embryonic Stem Cell (hESC) [16]. On the table, methods not marked with an asterisk run in parallel; the time documented is approximately the time it takes to process the TAD method on five normalizations algorithms. The jobs marked with an asterisk are run serially on the server; the time documented is approximately the time it takes to process the TAD callers on a single normalization method. The jobs not run in parallel consume more computational resources, especially memory, and we found out that it is counterproductive to run these methods in parallel; hence, we execute them serially. Also, on the Table S2, we show the times it takes to normalize the input data uploaded for these datasets on all the five normalization algorithms that we provide.